Search CORE

776 research outputs found

Fair Evaluation of Global Network Aligners

Author: Crawford Joseph
Milenković Tijana
Sun Yihan
Publication venue
Publication date: 17/07/2014
Field of study

Biological network alignment identifies topologically and functionally conserved regions between networks of different species. It encompasses two algorithmic steps: node cost function (NCF), which measures similarities between nodes in different networks, and alignment strategy (AS), which uses these similarities to rapidly identify high-scoring alignments. Different methods use both different NCFs and different ASs. Thus, it is unclear whether the superiority of a method comes from its NCF, its AS, or both. We already showed on MI-GRAAL and IsoRankN that combining NCF of one method and AS of another method can lead to a new superior method. Here, we evaluate MI-GRAAL against newer GHOST to potentially further improve alignment quality. Also, we approach several important questions that have not been asked systematically thus far. First, we ask how much of the node similarity information in NCF should come from sequence data compared to topology data. Existing methods determine this more-less arbitrarily, which could affect the resulting alignment(s). Second, when topology is used in NCF, we ask how large the size of the neighborhoods of the compared nodes should be. Existing methods assume that larger neighborhood sizes are better. We find that MI-GRAAL's NCF is superior to GHOST's NCF, while the performance of the methods' ASs is data-dependent. Thus, the combination of MI-GRAAL's NCF and GHOST's AS could be a new superior method for certain data. Also, which amount of sequence information is used within NCF does not affect alignment quality, while the inclusion of topological information is crucial. Finally, larger neighborhood sizes are preferred, but often, it is the second largest size that is superior, and using this size would decrease computational complexity. Together, our results give several general recommendations for a fair evaluation of network alignment methods.Comment: 19 pages. 10 figures. Presented at the 2014 ISMB Conference, July 13-15, Boston, M

arXiv.org e-Print Archive

Crossref

Springer - Publisher Connector

Efficient Construction of Probabilistic Tree Embeddings

Author: Blelloch Guy E.
Gu Yan
Sun Yihan
Publication venue
Publication date: 01/01/2017
Field of study

In this paper we describe an algorithm that embeds a graph metric

(V,d_G)

on an undirected weighted graph

G=(V,E)

into a distribution of tree metrics

(T,D_T)

such that for every pair

u,v\in V

d_G(u,v)\leq d_T(u,v)

and

{\bf{E}}_{T}[d_T(u,v)]\leq O(\log n)\cdot d_G(u,v)

. Such embeddings have proved highly useful in designing fast approximation algorithms, as many hard problems on graphs are easy to solve on tree instances. For a graph with

n

vertices and

m

edges, our algorithm runs in

O(m\log n)

time with high probability, which improves the previous upper bound of

O(m\log^3 n)

shown by Mendel et al.\,in 2009. The key component of our algorithm is a new approximate single-source shortest-path algorithm, which implements the priority queue with a new data structure, the "bucket-tree structure". The algorithm has three properties: it only requires linear time in the number of edges in the input graph; the computed distances have a distance preserving property; and when computing the shortest-paths to the

k

-nearest vertices from the source, it only requires to visit these vertices and their edge lists. These properties are essential to guarantee the correctness and the stated time bound. Using this shortest-path algorithm, we show how to generate an intermediate structure, the approximate dominance sequences of the input graph, in

O(m \log n)

time, and further propose a simple yet efficient algorithm to converted this sequence to a tree embedding in

O(n\log n)

time, both with high probability. Combining the three subroutines gives the stated time bound of the algorithm. Then we show that this efficient construction can facilitate some applications. We proved that FRT trees (the generated tree embedding) are Ramsey partitions with asymptotically tight bound, so the construction of a series of distance oracles can be accelerated

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Multi-task Representation Learning for Pure Exploration in Linear Bandits

Author: Du Yihan
Huang Longbo
Sun Wen
Publication venue
Publication date: 30/05/2023
Field of study

Despite the recent success of representation learning in sequential decision making, the study of the pure exploration scenario (i.e., identify the best option and minimize the sample complexity) is still limited. In this paper, we study multi-task representation learning for best arm identification in linear bandits (RepBAI-LB) and best policy identification in contextual linear bandits (RepBPI-CLB), two popular pure exploration settings with wide applications, e.g., clinical trials and web content optimization. In these two problems, all tasks share a common low-dimensional linear representation, and our goal is to leverage this feature to accelerate the best arm (policy) identification process for all tasks. For these problems, we design computationally and sample efficient algorithms DouExpDes and C-DouExpDes, which perform double experimental designs to plan optimal sample allocations for learning the global representation. We show that by learning the common representation among tasks, our sample complexity is significantly better than that of the native approach which solves tasks independently. To the best of our knowledge, this is the first work to demonstrate the benefits of representation learning for multi-task pure exploration

arXiv.org e-Print Archive

Deconvolution approach for floating wind turbines

Author: Gaidai Oleg
Liu Zirui
Sun Jiayao
Xing Yihan
Publication venue: John Wiley & Sons Ltd.
Publication date: 01/05/2023
Field of study

Green renewable energy is produced by floating offshore wind turbines (FOWT), a crucial component of the modern offshore wind energy industry. It is a safety concern to accurately evaluate excessive weights while the FOWT operates in adverse weather conditions. Under certain water conditions, dangerous structural bending moments may result in operational concerns. Using commercial FAST software, the study's hydrodynamic ambient wave loads were calculated and converted into FOWT structural loads. This article suggests a Monte Carlo-based engineering technique that, depending on simulations or observations, is computationally effective for predicting extreme statistics of either the load or the response process. The innovative deconvolution technique has been thoroughly explained. The suggested approach effectively uses the entire set of data to produce a clear but accurate estimate for severe response values and fatigue life. In this study, estimated extreme values obtained using a novel deconvolution approach were compared to identical values produced using the modified Weibull technique. It is expected that the enhanced new de-convolution methodology may offer a dependable and correct forecast of severe structural loads based on the overall performance of the advised de-convolution approach due to environmental wave loading.publishedVersio

UiS Brage

Instruction Mining: When Data Mining Meets Large Language Model Finetuning

Author: Cao Yihan
Kang Yanbin
Sun Lichao
Wang Chi
Publication venue
Publication date: 27/10/2023
Field of study

Large language models (LLMs) are initially pretrained for broad capabilities and then finetuned with instruction-following datasets to improve their performance in interacting with humans. Despite advances in finetuning, a standardized guideline for selecting high-quality datasets to optimize this process remains elusive. In this paper, we first propose InstructMining, an innovative method designed for automatically selecting premium instruction-following data for finetuning LLMs. Specifically, InstructMining utilizes natural language indicators as a measure of data quality, applying them to evaluate unseen datasets. During experimentation, we discover that double descent phenomenon exists in large language model finetuning. Based on this observation, we further leverage BlendSearch to help find the best subset among the entire dataset (i.e., 2,532 out of 100,000). Experiment results show that InstructMining-7B achieves state-of-the-art performance on two of the most popular benchmarks: LLM-as-a-judge and Huggingface OpenLLM leaderboard.Comment: 22 pages, 7 figure

arXiv.org e-Print Archive

Efficient Parallel Output-Sensitive Edit Distance

Author: Ding Xiangyun
Dong Xiaojun
Gu Yan
Liu Youzhe
Sun Yihan
Publication venue
Publication date: 01/01/2023
Field of study

Given two strings

A[1..n]

and

B[1..m]

, and a set of operations allowed to edit the strings, the edit distance between

A

and

B

is the minimum number of operations required to transform

A

into

B

. Sequentially, a standard Dynamic Programming (DP) algorithm solves edit distance with

\Theta(nm)

cost. In many real-world applications, the strings to be compared are similar and have small edit distances. To achieve highly practical implementations, we focus on output-sensitive parallel edit-distance algorithms, i.e., to achieve asymptotically better cost bounds than the standard

\Theta(nm)

algorithm when the edit distance is small. We study four algorithms in the paper, including three algorithms based on Breadth-First Search (BFS) and one algorithm based on Divide-and-Conquer (DaC). Our BFS-based solution is based on the Landau-Vishkin algorithm. We implement three different data structures for the longest common prefix (LCP) queries needed in the algorithm: the classic solution using parallel suffix array, and two hash-based solutions proposed in this paper. Our DaC-based solution is inspired by the output-insensitive solution proposed by Apostolico et al., and we propose a non-trivial adaption to make it output-sensitive. All our algorithms have good theoretical guarantees, and they achieve different tradeoffs between work (total number of operations), span (longest dependence chain in the computation), and space. We test and compare our algorithms on both synthetic data and real-world data. Our BFS-based algorithms outperform the existing parallel edit-distance implementation in ParlayLib in all test cases. By comparing our algorithms, we also provide a better understanding of the choice of algorithms for different input patterns. We believe that our paper is the first systematic study in the theory and practice of parallel edit distance

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Parallel Longest Increasing Subsequence and van Emde Boas Trees

Author: Gu Yan
Men Ziyang
Shen Zheqi
Sun Yihan
Wan Zijin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 17/04/2023
Field of study

This paper studies parallel algorithms for the longest increasing subsequence (LIS) problem. Let

n

be the input size and

k

be the LIS length of the input. Sequentially, LIS is a simple problem that can be solved using dynamic programming (DP) in

O(n\log n)

work. However, parallelizing LIS is a long-standing challenge. We are unaware of any parallel LIS algorithm that has optimal

O(n\log n)

work and non-trivial parallelism (i.e.,

\tilde{O}(k)

o(n)

span). This paper proposes a parallel LIS algorithm that costs

O(n\log k)

work,

\tilde{O}(k)

span, and

O(n)

space, and is much simpler than the previous parallel LIS algorithms. We also generalize the algorithm to a weighted version of LIS, which maximizes the weighted sum for all objects in an increasing subsequence. To achieve a better work bound for the weighted LIS algorithm, we designed parallel algorithms for the van Emde Boas (vEB) tree, which has the same structure as the sequential vEB tree, and supports work-efficient parallel batch insertion, deletion, and range queries. We also implemented our parallel LIS algorithms. Our implementation is light-weighted, efficient, and scalable. On input size

10^9

, our LIS algorithm outperforms a highly-optimized sequential algorithm (with

O(n\log k)

cost) on inputs with

k\le 3\times 10^5

. Our algorithm is also much faster than the best existing parallel implementation by Shen et al. (2022) on all input instances.Comment: to be published in Proceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '23

arXiv.org e-Print Archive